THE NON-BAYESIAN RESTLESS MULTI- ARMED BANDIT: A CASE OF 
NEAR-LOGARITHMIC REGRET 



Wenhan Dai^*, Yi Gafi, Bhaskar Krishnamacharfi , QingZhacfi 

' School of Information Science and Technology, Tsinghua University, China 
* Ming Hsieh Department of Electrical Engineering, University of Southern California, U.S.A., 
§ Department of Electrical and Computer Engineering, University of California, Davis, U.S.A 



ABSTRACT 

In the classic Bayesian restless multi-armed bandit (RMAB) prob- 
lem, there are TV arms, with rewards on all arms evolving at each 
time as Markov chains with known parameters. A player seeks to 
activate K > 1 arms at each time in order to maximize the expected 
total reward obtained over multiple plays. RMAB is a challenging 
problem that is known to be PSPACE-hard in general. We consider 
in this work the even harder non-Bayesian RMAB, in which the pa- 
rameters of the Markov chain are assumed to be unknown a priori. 
We develop an original approach to this problem that is applicable 
when the corresponding Bayesian problem has the structure that, de- 
pending on the known parameter values, the optimal solution is one 
of a prescribed finite set of policies. In such settings, we propose to 
learn the optimal policy for the non-Bayesian RMAB by employing 
a suitable meta-policy which treats each policy from this finite set 
as an arm in a different non-Bayesian multi-armed bandit problem 
for which a single-arm selection policy is optimal. We demonstrate 
this approach by developing a novel sensing policy for opportunistic 
spectrum access over unknown dynamic channels. We prove that our 
policy achieves near-logarithmic regret (the difference in expected 
reward compared to a model-aware genie), which leads to the same 
average reward that can be achieved by the optimal policy under a 
known model. This is the first such result in the literature for a non- 
Bayesian RMAB. 

Index Terms — restless bandit, regret, opportunistic spectrum 
access, learning, non-Bayesian 

1. INTRODUCTION 

Multi-armed bandit (MAB) problems are fundamental tools for 
optimal decision making in dynamic, uncertain environments. In 
a multi-armed bandit problem, there are TV arms each generating 
stochastic rewards, and a player seeks a policy to activate K > 1 
arms at each time in order to maximize the expected total reward 
obtained over multiple plays. MAB problems can be broadly clas- 
sified as Bayesian (if player knows the statistical model/parameters 
of the reward process for each arm) or non-Bayesian (if the model 
for the reward process is a priori unknown to the user). In the case 
of non-Bayesian MAB problems, the objective is to design an arm 
selection policy that minimizes regret, defined as the gap between 
the expected reward that can be achieved by a genie that knows the 
parameters, and that obtained by the given policy. It is desirable 
to have a regret that grows as slowly as possible over time (if the 
regret is sub-linear, the average regret per slot tends to zero over 
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time, and the policy achieves the maximum average reward that can 
be achieved under a known model). 

A particularly challenging variant of these problems is the rest- 
less multi-armed bandit problem (RMAB) (T), m which the rewards 
on all arms evolve at each time as Markov chains. Even in the 
Baysian case, where the parameters of the Markov chains are known, 
this problem is difficult to solve, and has been proved to be PSPACE 
hard |2|. One approach to this problem has been Whittle's index, 
which is asymptotically optimal under certain regimes; however it 
does not always exist, and even when it does, it is not easy to com- 
pute. It is only in very recent work that non-trivial tractable classes 
of RMAB where Whittle's index exists and is computable have been 
identified (3j. 

We consider in this work the even harder non-Bayesian RMAB, 
in which the parameters of the Markov chain are further assumed 
to be unknown a priori. Our main contribution in this work is a 
novel approach to this problem that is applicable whenever the cor- 
responding Bayesian RMAB problem has the structure that the pa- 
rameter space can be partitioned into a finite number of sets, for each 
of which there is a single optimal policy. Our approach essentially 
develops a meta-policy that treats these policies as arms in a dif- 
ferent non-Bayesian multi-armed bandit problem for which a single 
arm selection policy is optimal for the genie, and tries to learn which 
policy from this finite set gives the best performance. 

We demonstrate our approach on a practical problem pertain- 
ing to dynamic spectrum sensing. In this problem, we consider a 
scenario where a secondary user must select one of TV channels to 
sense at each time to maximize its expected reward from transmis- 
sion opportunities. If the primary user occupancy on each channel 
is modeled to be an identical but independent Markov chain with 
unknown parameters, we obtain a non-Bayesian RMAB with the 
requisite structure. We develop an efficient new multi-channel cog- 
nitive sensing algorithm for unknown dynamic channels based on 
the above approach. We prove for TV = 2, 3 that this algorithm 
achieves regret (the gap between the expected optimal reward ob- 
tained by a model-aware genie and that obtained by the given pol- 
icy) that is bounded uniformly over time n by a function that grows 
as 0(G(n) -log n), where G(n) can be any arbitrarily slowly diverg- 
ing non-decreasing sequence. This is the first non-Bayesian RMAB 
policy that achieves the maximum average reward defined by the op- 
timal policy under a known model. 

There are two parallel investigations on non-Bayesian RMAB 
problems given in (8j |5J, where a more general RMAB model is 
considered but under a much weaker definition of regret. Specifi- 
cally, in (8][9), regret is defined with respect to the maximum reward 
that can be offered by a single arm/channel. Note that for RMAB 
with a known model, staying with the best arm is suboptimal. Thus, 



a sublinear regret under this definition does not imply the maximum 
average reward, and the deviation from the maximum average re- 
ward can be arbitrarily large. 



2. A NEW APPROACH FOR NON-BAYESIAN RMAB 

We first describe a structured class of finite-option Bayesian RMAB 
problems that we will refer to as Let B(P) be a Bayesian 

RMAB problem with the Markovian evolution of arms described by 
the transition matrix P. We say that B(P) £ ^ m if and only if there 
exists a partition of the parameter values P into a finite number of 
m sets {Si, S2, ■■■Sm} and a set of policies m (Vi = 1 . . . m) that 
do not assume knowledge of P and are optimal whenever P 6 Si. 
Despite the general hardness of the RMAB problem, problems with 
such structure do indeed exist, as has been shown in |4]|5]|3j- 

We propose a solution to the non-Bayesian version of the prob- 
lem that leverages the finite solution option structure when we have 
that the corresponding Bayesian version B(P) 6 *l/ m . In this case, 
although the player does not know the exact parameter P, it must 
be true that one of the m policies iTi will yield the highest expected 
reward (corresponding to the set Si that contains the true, unknown 
P). These policies can thus be treated as arms in a different non- 
Bayesian multi-armed bandit problem for which a single-arm selec- 
tion policy is optimal for the genie. Then, a suitable meta-policy that 
sequentially operates these policies while trying to minimize regret 
can be adopted. This can be done with an algorithm based on the 
well-known schemes proposed by Lai and Robbins |6 |, and Auer et 
al 

One subtle issue that must be handled in adopting such an al- 
gorithm as a meta-policy is how long to play each policy. An ideal 
constant length of play could be determined only with knowledge of 
the underlying unknown parameters P, so our approach is to have 
the duration for which each policy is operated slowly increase over 
time. 

In the following, we demonstrate this novel meta-policy ap- 
proach to the dynamic spectrum access problem discussed in (4j [5] 
where the Bayesian version of the RMAB has been shown to belong 
to the class \I r 2- For this problem, we show that our approach yields 
an algorithm with provably near-logarithmic regret, thus achieving 
the same average reward offered by the optimal RMAB policy under 
a known model. 



3. A DYNAMIC SPECTRUM ACCESS PROBLEM 

We consider a slotted system where a secondary user is trying to 
access TV independent channels, with the availability of each channel 
evolving as a two-state Markov chain with identical transition matrix 
P that is a priori unknown to the user. The user can only see the 
state of the sensed channel. If the user selects channel i at time 
i, and upon sensing finds the state of the channel Si(t) to be 1, it 
receives a unit reward for transmitting. If it instead finds the channel 
to be busy, i.e., Si(t) — 0, it gets no reward at that time. The user 
aims to maximize its expected total reward (throughput) over some 
time horizon by choosing judiciously a sensing policy that governs 
the channel selection in each slot. We are interested in designing 
policies that perform well with respect to regret, which is defined as 
the difference between the expected reward that could be obtained 
using the omniscient policy 7r* that knows the transition matrix P, 
and that obtained by the given policy n. The regret at time n can be 



expressed as: 

R(p, n(i), n) = e-* [E^y** (p, n(i), *)] 

-E"[Y,? =1 Y*(P,n(l),t)], 

where cj; is the initial probability that Si(l) = 1, P is the 
transition matrix of each channel, Y^* (P, 0(1), t)] is the re- 
ward obtained in time t with the optimal policy, Y^{P, 0(1), i) 
is the reward obtained in time t with the given policy. We de- 
note 0(t) = [cJi(t), . . . ,ajjv(i)] as the belief vector where UJi(t) 
is the conditional probability that Si(t) = 1 (and let 0(1) = 
[wi(l), . . . , wjv(l)] denote the initial belief vector used in myopic 
sensing algorithm Rl). 

4. SENSING UNKNOWN DYNAMIC CHANNELS 

As has been shown in |4|, the myopic policy has a simple structure 
for switching between channels that depends only on the correlation 
sign of the transition matrix P, i.e. whether pn > poi (positively 
correlated) orpn < poi (negatively correlated). 

In particular, if the channel is positively correlated, then the my- 
opic policy corresponds to 

• Policy tti: stay on a channel whenever it shows a "1" and 
switch on a "0" to the channel visited the longest ago. 

If the channel is negatively correlated, then it corresponds to 

• Policy TT2' staying on a channel when it shows a "0", and 
switching as soon as "1" is observed, to either the channel 
most recently visited among those visited an even number of 
steps before, or if there are no such channels, to the one vis- 
ited the longest ago. 

Furthermore, it has been shown in |4 5] that the myopic policy 
is optimal for TV = 2, 3, and for any TV in the case of positively 
correlated channels (the optimality of the myopic policy for TV > 3 
negatively correlated channels is conjectured for the infinite-horizon 
case). As a consequence, this special class of RMAB has the re- 
quired finite dependence on its model as described in Sec. 2; specif- 
ically, it belongs to 2. We can thus apply the general approach 
based on the concept of meta-policy. Specifically, the algorithm 
treats these two policies as arms in a classic non-Bayesian multi- 
armed bandit problem, with the goal of learning which one gives the 
higher reward. 

A key question is how long to operate each arm at each step. It 
turns out from the analysis we present in the next section that it is 
desirable to slowly increase the duration of each step using any (ar- 
bitrarily slowly) divergent non-decreasing sequence of positive inte- 
gers {if„}~ =1 . 

The channel sensing policy we thus construct is shown in Algo- 
rithmQ] 

5. REGRET ANALYSIS 

We first define the discrete function G(n) which represents the value 
of Ki at the n th time step in AlgorithmQ] 

G(n) = minTV/ s.t. ^K % >n (2) 

i=l 

Note that since Ki can be any arbitrarily slow non-decreasing di- 
verging sequence G(n) can also grow arbitrarily slowly. 

The following theorem states that the regret of our algorithm 
grows close to logarithmically with time. 



Algorithm 1 Sensing Policy for Unknown Dynamic Channels 
l: // Initialization 

2: Play policy 7ri for K\ times, denote A\ as the sample mean of 

these K\ rewards 
3: Play policy 7T2 for K 2 times, denote A 2 as the sample mean of 

these K 2 rewards 

4: Xi = Ai,X 2 = A 2 
5: n = Ki + K 2 

6: i = 3, ii = 1, i 2 = 1 

7: //Main loop 

8: while 1 do 

9: Find j such that j — arg max -r* + . /l^S 
10: i 3 - = ij + 1 

11: Play policy 7Tj for Ki times, let Aj(ij) record the sample 

mean of these Ki rewards 
12: X ] =X J +A 3 {ij) 
13: i = i + l 
14: n = n + Ki\ 

15: end while 



Theorem 1 For the dynamic spectrum access problem with N — 
2, 3 i.i.d. channels with unknown transition matrix P, the expected 
regret with Algorithm\T]after n time steps is at most Z\G(n) ln(n) + 
Z 2 ln(n) + ZsG(n) + Z4, where Zi, Z 2 , Z3, Z4 are constants only 
related to P. 

The proof of Theorem Q~| presented in the appendix, uses two 
interesting lemmas we have developed that we present here with- 
out proof. The first lemma is a non-trivial variant of the Chernoff- 
Hoeffding bound, that allows for bounded differences between the 
conditional expectations of sequence of random variables that are 
revealed sequentially: 

Lemma 1 Let X\ , ■ ■ ■ . X„ be random variables with range [0, 6] 
and such that \E\Xt\X\, ■ ■ ■ , Xt-i] — p>\ < C. C is a constant 
number such that < C < fi. Let S n = Xi + • • ■ + X n . Then for 
all a > 0, 

P{S n > n(u + C*) +a} < e ~ 2( ^Sj )2/n (3) 
P{S n < n(fi - C) - a} < e - 2< - a/b)2/n (4) 

The second lemma states that the expected loss of reward for 
either policy due to starting with an arbitrary initial belief vector 
compared to the reward Ui(P) that would obtained by playing the 
policy at steady state is bounded by a constant d (P) that depends 
only on the policy used and the transition matrix. These constants 
can be calculated explicitly, but we omit the details for brevity. 

Lemma 2 For any initial belief vector £2(1) an d an y positive 
integer L, if we use policy 7T j (i — 1,2) for L times, and the 
summed expectation of the rewards for these L steps is denoted as 
B' r <[Ef =1 F 7ri (P,n(l),*)], then 

\E^[^ =1 Y^{V,n{l),t)]-L-U t {P)\ <Ci[P) (5) 

Remark: Theorem 1 has been stated for the cases N = 2, 3, 
which are the only cases when the Myopic policy has been proved 
to be optimal for the known-parameter case for all values of P. In 
fact, our proof shows something even stronger than this: that Al- 
gorifhm[T]yields the claimed near-logarithmic regret with respect to 



the Myopic policy for any N. The Myopic policy is known to be 
always optimal for 7Y = 2, 3, and for any iV so long as the Markov 
chain is positively correlated. In case it is negatively correlated, it is 
an open question whether it is optimal for an infinite horizon case. 
If this were to be true, the algorithm we have presented would also 
offer near-logarithmic regret asymptotically as the time variable n 
increases, for any 7Y. 
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7. APPENDIX 

Proof of Theorem [T] 

We first derive a bound on the regret for the case when poi < 
pn. In this case, policy 7ri would be the optimal. Based on Lemma 
II the difference of E 7 * 1 [T,^ =1 Y 7ri (P, t)] and Ui ■ n is no 
more than Ci, therefore we only need to prove R'(P, £2(1), n), the 
regret in the case when policy tvi is optimal, which is defined as 
R'(P,Q.(l),n) ± Ui ■ n - S' ri [EL 1 K 1 (P,n(l)^)], is at most 
ZiG(n) ln(n) + Z 2 ln(n) + Z 3 G(n) + Z 4 , Z lt Z 2 ,Z 3 , Z 4 are con- 
stants only related to P. 

The regret comes from two parts: the regret when using policy 
tv 2 ; the regret between Ui and E 7 ' 1 \Y ni (P, £2(1), t)] when using 
policy 7Ti. From Lemma[2] we know that each time when we switch 
from policy tt 2 to policy 7Ti , at most we lose a constant-level value 
from the second part. So if the number of selections of policy ir 2 
in Line|9]of Algorithm Q] is bounded by O(lnn), both parts of the 
regret can be bounded by 0(G(n) ■ Inn). 

For ease of exposition, we discuss the slots n such that G\\n , 
where G\ \n denotes that time n is the end of successive G(n) plays. 



We define q as the smallest index such that K q > \ 
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Note that it is possible to define a(Ui , Ci , P) such that if policy 7Ti 
is played si > a times, 
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f/{s-q))<2t- 



(6) 



We could also define /3(U2, G2, P) such that if policy 7T2 is played 
S2 > /3 times, 



o ( U 2 -C 2 /K q ,tj ,C 3 __ 1 \, /31nt\2 

s-g 

Moreover, there exists 7 = [max{5a + 1, e 4a/3 , 5/3 + 1, e 4/3/3 }] 
such that when G(n) > K 1 , policy 7Ti is played at least a times and 
policy 7T2 is played at least j3 times. 

Denote T(n) as the number of times we select policy 7T2 up to 
time n. Then, for any positive integer I, we have: 



ForA(n) = [(3(1+ %±%% ) 2 lnn)/(Eft - 1/ 2 - 
i ll lb is false. So we get: 

E(T(n)) < A(n) + 7 + ££i££ 1= iES a=1 4t- 4 

. . v 2tt 2 
< A(n)+ 7 + — . 



(14) 



Therefore, we have: 

ii'(P,n(l),n) < G(n)+ 

2tt 2 (15) 
((C/i - U 2 )G(n) + Ci)(A(n) + 7 + — ) 

This concludes the bound in case pn > poi. The derivation 
of the bound is similar for the case when pn < poi with the key 
difference of 7' instead of 7, and the Ci, C/i terms being replaced 
by G2, C2 and vice versa. Then we have that the regret in either case 
has the following bound: 
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where is the index function defined to be 1 when the predicate 
x is true, and when it is false predicate; ij (t) is the number of times 
we select policy 7T,- when up to time t, Vj = 1, 2; Xj(t) is the sum 
of every sample mean for A'; plays up to time t; Xj, Si is the sum of 
every sample mean for K Si times using policy n%. 

The condition + < ^a. + y'^p} implies 

that at least one of the following must hold: 
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iJ(P,fi(l),n) < G(n) + (|U2 -C/i|G(n) + max{Ci,C 2 }) 
<3 (l + m «{^±ggj > g±g|g f }) a ln» 
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This inequality can be readily translated to the simplified form 
of the bound given in the statement of Theorem 1, where: 
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Note that Xi, Sl = A1.1 + Ai, 2 + ■ ■ ■ + -Ai,^, where Ai,i is 
sample average reward for the i t f, time selecting policy 7Ti . Due to 
the definition of a and K q , the expected value of A\ t i is between 
Ci — and C/i + §^ if i > ? (Lemma[2]l. Then applying Lemma 
Q] and the results in l|6} and Q, 
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